26 research outputs found
Counterfactual Risk Minimization: Learning from Logged Bandit Feedback
We develop a learning principle and an efficient algorithm for batch learning
from logged bandit feedback. This learning setting is ubiquitous in online
systems (e.g., ad placement, web search, recommendation), where an algorithm
makes a prediction (e.g., ad ranking) for a given input (e.g., query) and
observes bandit feedback (e.g., user clicks on presented ads). We first address
the counterfactual nature of the learning problem through propensity scoring.
Next, we prove generalization error bounds that account for the variance of the
propensity-weighted empirical risk estimator. These constructive bounds give
rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM
can be used to derive a new learning method -- called Policy Optimizer for
Exponential Models (POEM) -- for learning stochastic linear rules for
structured output prediction. We present a decomposition of the POEM objective
that enables efficient stochastic gradient optimization. POEM is evaluated on
several multi-label classification problems showing substantially improved
robustness and generalization performance compared to the state-of-the-art.Comment: 10 page
Estimating Position Bias without Intrusive Interventions
Presentation bias is one of the key challenges when learning from implicit
feedback in search engines, as it confounds the relevance signal. While it was
recently shown how counterfactual learning-to-rank (LTR) approaches
\cite{Joachims/etal/17a} can provably overcome presentation bias when
observation propensities are known, it remains to show how to effectively
estimate these propensities. In this paper, we propose the first method for
producing consistent propensity estimates without manual relevance judgments,
disruptive interventions, or restrictive relevance modeling assumptions. First,
we show how to harvest a specific type of intervention data from historic
feedback logs of multiple different ranking functions, and show that this data
is sufficient for consistent propensity estimation in the position-based model.
Second, we propose a new extremum estimator that makes effective use of this
data. In an empirical evaluation, we find that the new estimator provides
superior propensity estimates in two real-world systems -- Arxiv Full-text
Search and Google Drive Search. Beyond these two points, we find that the
method is robust to a wide range of settings in simulation studies
Unbiased Learning for the Causal Effect of Recommendation
Increasing users' positive interactions, such as purchases or clicks, is an
important objective of recommender systems. Recommenders typically aim to
select items that users will interact with. If the recommended items are
purchased, an increase in sales is expected. However, the items could have been
purchased even without recommendation. Thus, we want to recommend items that
results in purchases caused by recommendation. This can be formulated as a
ranking problem in terms of the causal effect. Despite its importance, this
problem has not been well explored in the related research. It is challenging
because the ground truth of causal effect is unobservable, and estimating the
causal effect is prone to the bias arising from currently deployed
recommenders. This paper proposes an unbiased learning framework for the causal
effect of recommendation. Based on the inverse propensity scoring technique,
the proposed framework first constructs unbiased estimators for ranking
metrics. Then, it conducts empirical risk minimization on the estimators with
propensity capping, which reduces variance under finite training samples. Based
on the framework, we develop an unbiased learning method for the causal effect
extension of a ranking metric. We theoretically analyze the unbiasedness of the
proposed method and empirically demonstrate that the proposed method
outperforms other biased learning methods in various settings.Comment: accepted at RecSys 2020, updated several experiment
The Self-Normalized Estimator for Counterfactual Learning
Abstract This paper identifies a severe problem of the counterfactual risk estimator typically used in batch learning from logged bandit feedback (BLBF), and proposes the use of an alternative estimator that avoids this problem. In the BLBF setting, the learner does not receive full-information feedback like in supervised learning, but observes feedback only for the actions taken by a historical policy. This makes BLBF algorithms particularly attractive for training online systems (e.g., ad placement, web search, recommendation) using their historical logs. The Counterfactual Risk Minimization (CRM) principle [1] offers a general recipe for designing BLBF algorithms. It requires a counterfactual risk estimator, and virtually all existing works on BLBF have focused on a particular unbiased estimator. We show that this conventional estimator suffers from a propensity overfitting problem when used for learning over complex hypothesis spaces. We propose to replace the risk estimator with a self-normalized estimator, showing that it neatly avoids this problem. This naturally gives rise to a new learning algorithm -Normalized Policy Optimizer for Exponential Models (Norm-POEM) -for structured output prediction using linear rules. We evaluate the empirical effectiveness of Norm-POEM on several multi-label classification problems, finding that it consistently outperforms the conventional estimator